feat: DashboardHygieneAnalyzer (broken panels) by cicdteam · Pull Request #23 · remetric-dev/remetric

cicdteam · 2026-05-23T07:48:26Z

Summary

Adds the last v0.1 analyzer: flags Grafana dashboards whose panel queries reference Prometheus metrics that do not exist (not in head series, not in recording-rule outputs). One finding per (dashboard, missing-metric) pair, severity Medium.

Scope narrowed from spec §6.3: ships only broken-panel detection in v0.1. Untouched-dashboard detection (weak proxy without Grafana Enterprise meta.viewedAt) and near-duplicate detection (no canonical "panel signature" definition) are deferred per .claude/docs/superpowers/specs/2026-05-23-dashboard-sprawl-analyzer-design.md §11.

What's new

Analyzer internal/analyzers/dashboardhygiene/ with happy-path detection, recording-rule resolution, VM-without-vmalert graceful-degrade, silent-skip for template-variable + non-prom datasources, fix-snippet builder.
CLI remetric dashboards broken --prometheus <URL> --grafana <URL> (both flags required). Honors all standard flags including new --ignore-dashboard <regex>.
Wired into scan and report runner slices (no-Grafana → warning, parallel to unusedmetrics).
Types:
- Finding.Dashboard string field with omitempty.
- ignore.Patterns.Dashboard + matching --ignore-dashboard flag.
- Renamed ClassDashboardSprawl → ClassBrokenPanel, CategoryDashboardSprawl → CategoryDashboardHygiene (no live emitters of the old names).
Grafana client additive extensions: Dashboard.PanelTargets() (flat panel-title + expr pairs) and Client.BaseURL() (defensive copy).
promqlx fix: isSentinel switched from equality to substring containment. Catches concatenations like ${metric}_total → __remetric_var___total that previously leaked into the extracted metric set, polluting findings in both dashboardhygiene and unusedmetrics.
Docs docs/findings/broken-panel.md replaces the dashboard-sprawl.md placeholder; mkdocs nav + cross-link in unused-metric.md + README + --help text all updated.
E2E e2e/dashboards_e2e_test.go provisions a broken-panel dashboard via file-based Grafana provisioning, asserts the finding.

Test plan

go test ./... -count=1 -race (20 packages, all PASS)
make fmt vet lint vuln (0 issues, no vulnerabilities)
make cover (total 86.1%, dashboardhygiene 85.1% — exceeds 75% floor + 80% target)
make e2e (all 8 e2e tests PASS including new TestE2E_DashboardsBroken_JSON)
CI green on this PR before merge

Commits

21 commits with per-task two-stage review (spec compliance → code quality), each commit + fixup is independently buildable + tested. Squash-friendly history; bisect-friendly if anything regresses later.

…panel/dashboard-hygiene Scope narrows in v0.1 to only broken-panel detection; the old names had no live emitters.

Used by DashboardHygieneAnalyzer to carry the dashboard title. omitempty keeps the wire form clean for non-dashboard findings.

Anchored regex against Finding.Dashboard. Empty field never matches. Wires through config.IgnoreConfig.Dashboard.

PanelTargets walks rows recursively, returns (panel-title, expr) pairs filtered to Prometheus targets. BaseURL is a defensive copy used by the dashboard-hygiene analyzer to build absolute dashboard URLs in Fix snippets.

New analyzer flags Grafana dashboards whose panel queries reference missing Prometheus metrics. Skeleton + nil-Graf warning path; full algorithm in subsequent commits.

Walks every Grafana dashboard, parses Prometheus targets via promqlx, and groups (dashboard, missing-metric) pairs. Severity Medium per the design. Recording-rule outputs and fix snippet come in later commits.

…rename Code-quality follow-ups to the happy-path commit: - add Dashboard tiebreaker to the comparator so output is deterministic across map iterations - extract the comparator into findingLess to keep Analyze under the gocyclo limit (was 15, the extra branch pushed it to 16) - extract sample-cap 5 to a named const for grep-ability - rename buildFinding parameter to mirror the call-site variable name

Same missing metric across multiple panels yields one finding; distinct missing metrics in the same dashboard emit separate findings.

A recording rule whose output is not yet in head series must still be treated as a known metric. Mirrors the resolution flow in unusedmetrics, including the VictoriaMetrics graceful-degrade sentinel.

… test Code-quality follow-ups to the recording-rule resolution commit: - expand the two BuildInfo-adjacent comments to explain WHY each VM-flavor check exists (404 path vs 200-empty-groups path), so a future reader doesn't need to cross-reference unusedmetrics - strengthen the RR test by querying AlertA from a second panel: proves the type filter (r.Type == "recording") actually held, not just that the recording-rule output was added to exists

Grafana template variables like ${metric}_total sanitise to __remetric_var___total - a valid PromQL identifier that parses cleanly and leaks into the extracted metric set. Change isSentinel to a Contains check so any name containing the sentinel substring is treated as a sanitiser artifact and filtered. Without this, dashboardhygiene and unusedmetrics would treat template-variable expressions as references to bogus metrics named '__remetric_var__*'.

…and Loki targets Grafana template-variable queries (${metric}_total) and non-Prometheus datasources must not generate findings or warnings. The template-variable path relies on promqlx filtering sentinel-derived metric names; the Loki path relies on PanelTargets filtering by datasource type.

Per-dashboard fetch errors degrade to warnings without aborting the analyzer. Search() failure is fatal. VictoriaMetrics without --vmalert emits the recording-rules-unavailable warning.

Renders a paste-ready instruction block: restore the metric or remove the broken queries. Drops the URL line when no absolute dashboard URL is available. Caps the panel list at 10 entries with a '... and N more' tail.

… builtin Code-quality follow-ups to the fix-snippet commit: - pull the broken-panel docs URL from findings.DocURL(ClassBrokenPanel) instead of a hardcoded literal, so the single source of truth in internal/findings/ stays authoritative - replace explicit limit-clamping with the Go 1.21+ min() builtin - replace C-style index loop with idiomatic range over the slice

New top-level subject 'dashboards' with one action 'broken'. Requires --prometheus and --grafana. Honors --output, --min-severity, --ignore-dashboard, --ignore-metric, --fail-on, --limit, --timeout.

…pty.go Code-quality follow-ups to the dashboards broken subcommand: - drop the CLI's local re-sort; the analyzer already orders by (severity desc, sample-count desc, dashboard asc, metric asc), and the filter passes are stable. The local sort was discarding the sample-count tiebreaker - meaningful signal that broken-from- many-panels metrics rank higher within a severity tier - move brokenPanelCopy to empty.go for parity with cardinalityCopy and labelPatternCopy - extend TestEmptyCopy_Values to cover brokenPanelCopy and the previously-uncovered unusedMetricsCopy

Both flows include the new analyzer; without --grafana it emits a warning and zero findings (consistent with unusedmetrics).

…l page Real content for the broken-panel finding class. Updates the catalog, the unused-metric cross-link, the mkdocs nav, and the 'What's still missing in v0.1' README section since the analyzer now ships in v0.1.

Provisions a Grafana dashboard whose only panel queries a metric Prometheus does not scrape; runs 'remetric dashboards broken' and asserts the finding is emitted with class=broken-panel.

…text Follow-up to the analyzer landing. Updates user-facing copy that still listed the v0.1 analyzer set as four (now five with broken-panel) and omitted --ignore-dashboard from the ignore-* table.

cicdteam added 21 commits May 23, 2026 01:02

refactor(findings): rename dashboard-sprawl class/category to broken-…

6c2e107

…panel/dashboard-hygiene Scope narrows in v0.1 to only broken-panel detection; the old names had no live emitters.

feat(findings): add Finding.Dashboard field

ed524eb

Used by DashboardHygieneAnalyzer to carry the dashboard title. omitempty keeps the wire form clean for non-dashboard findings.

feat(ignore): add Dashboard pattern + --ignore-dashboard flag

a9bd3f8

Anchored regex against Finding.Dashboard. Empty field never matches. Wires through config.IgnoreConfig.Dashboard.

feat(grafana): add Dashboard.PanelTargets and Client.BaseURL

a2680ec

PanelTargets walks rows recursively, returns (panel-title, expr) pairs filtered to Prometheus targets. BaseURL is a defensive copy used by the dashboard-hygiene analyzer to build absolute dashboard URLs in Fix snippets.

feat(analyzers): add dashboardhygiene skeleton

0b403a8

New analyzer flags Grafana dashboards whose panel queries reference missing Prometheus metrics. Skeleton + nil-Graf warning path; full algorithm in subsequent commits.

feat(analyzers): dashboardhygiene happy-path detection

e5023fa

Walks every Grafana dashboard, parses Prometheus targets via promqlx, and groups (dashboard, missing-metric) pairs. Severity Medium per the design. Recording-rule outputs and fix snippet come in later commits.

test(dashboardhygiene): pin grouping by (dashboard, missing-metric)

25b0934

Same missing metric across multiple panels yields one finding; distinct missing metrics in the same dashboard emit separate findings.

feat(dashboardhygiene): recording-rule outputs count as existing

eb26f17

A recording rule whose output is not yet in head series must still be treated as a known metric. Mirrors the resolution flow in unusedmetrics, including the VictoriaMetrics graceful-degrade sentinel.

test(dashboardhygiene): pin error surface (per-dashboard, search, VM)

f343081

Per-dashboard fetch errors degrade to warnings without aborting the analyzer. Search() failure is fatal. VictoriaMetrics without --vmalert emits the recording-rules-unavailable warning.

feat(dashboardhygiene): fix-snippet builder

168e332

Renders a paste-ready instruction block: restore the metric or remove the broken queries. Drops the URL line when no absolute dashboard URL is available. Caps the panel list at 10 entries with a '... and N more' tail.

feat(cli): add 'remetric dashboards broken' subcommand

ac9701f

New top-level subject 'dashboards' with one action 'broken'. Requires --prometheus and --grafana. Honors --output, --min-severity, --ignore-dashboard, --ignore-metric, --fail-on, --limit, --timeout.

feat(cli): wire dashboardhygiene into scan + report runners

6d16642

Both flows include the new analyzer; without --grafana it emits a warning and zero findings (consistent with unusedmetrics).

test(e2e): broken-panel scenario against demo stack

247b843

Provisions a Grafana dashboard whose only panel queries a metric Prometheus does not scrape; runs 'remetric dashboards broken' and asserts the finding is emitted with class=broken-panel.

docs: surface dashboardhygiene + --ignore-dashboard in README + help …

dc6b7c7

…text Follow-up to the analyzer landing. Updates user-facing copy that still listed the v0.1 analyzer set as four (now five with broken-panel) and omitted --ignore-dashboard from the ignore-* table.

cicdteam merged commit 717b7ad into main May 23, 2026
4 checks passed

cicdteam deleted the dashboard-hygiene branch May 23, 2026 07:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: DashboardHygieneAnalyzer (broken panels)#23

feat: DashboardHygieneAnalyzer (broken panels)#23
cicdteam merged 21 commits into
mainfrom
dashboard-hygiene

cicdteam commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cicdteam commented May 23, 2026

Summary

What's new

Test plan

Commits

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant